PySCNet includes four modules:
1) Pro-precessing: initialize a gnetData object consisting of Expression Matrix, Cell Attributes, Gene Attributes and Network Attributes;
2) BuildNet: reconstruct GRNs by various methods implemented in docker;
3) NetEnrich: network analysis including consensus network detection, gene module identification and trigger path prediction as well as network fusion;
4) Visulization: network illustration.
A python package - STREAM was designed for reconstructing cell trajectory for single cell transcriptomic data. This tutorial guides how to integrate STREAM with pyscnet for gene regulatory network along the cell differential trajectory.
from __future__ import absolute_import
import warnings
warnings.filterwarnings("ignore")
import sys
import os
import itertools
import scanpy as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80, facecolor='white')
from pyvis.network import Network
import pandas as pd
import copy
import numpy as np
import matplotlib.pyplot as plt
from pyscnet.Preprocessing import gnetdata
from pyscnet.BuildNet import gne_dockercaller as gdocker
from pyscnet.BuildNet import gne_synchrony as gs
from pyscnet.NetEnrich import graph_toolkit as gt
pd.set_option('display.max_rows', 1000)
Data was obtained from Nestorowa, S. et al. The cell trajectory was build according to STREAM Tutorial. As STREAM is also built on AnnData structure, it can be directly imported into pyscnet.
#the file is too large
# import _pickle as pk
# with open('data/stream_adata.pk', 'rb') as input:
# adata = pk.load(input)
#user jupyter-notebook in stream env: ~/miniconda3/envs/stream/bin/
import stream as st
adata = st.read('data/stream_adata.pklz')
As explained in STREAM Tutorial, new attributes including cell pseudotime at different branches are added into obs of AnnData. It gives us the hints of ordering cells by pseudotime and estimating gene dynamics along cell trajectory.
adata.var_names
As shown below, 6 cell sub-populations form a Bifurcation trajectory starting from HSC expanding into two directions. One is MEP branch and the other one is GMP branch. Additionally, leaf markers and transition markers are detected for each branch. It guides us for feature selection
st.plot_stream_sc(adata,root='S3',color=['label'],
dist_scale=0.1,show_graph=True,show_text=True)
# save_fig=True, fig_path='result/cell_branches')
#leaf_markers of cell pathway: S1 -> S2
adata.uns['leaf_markers'][('S1','S2')].head()
st.plot_stream(adata,root='S3',color=['Gata1','Klf1'])
#Import stream anndata into pyscnet
stream_gne = gnetdata.load_from_scanpy(adata)
In the following, cells from S1 to S2 and positive leaf markers of (S1->S2) will be selected for GRN construction. As shown below, positive leaf markers are only highly expressed in cells of S1 -> S2. There are 236 genes and 535 cells selected for downstream analysis.
#select cells from S1 -> S2
cell_info = stream_gne.CellAttrs['CellInfo'].sort_values('S1_pseudotime', ascending=True)
cell = list(cell_info.loc[cell_info['branch_id_alias'].isin([('S2', 'S1')])].index)
len(cell)
#select positive leaf markers
feature = adata.uns['leaf_markers'][('S1', 'S2')]
# feature = feature.sort_values('zscore', ascending=False).head(100).index
feature = feature[feature.zscore > 0].index
# with open('result/S0_S1_leafMarker.txt', 'w') as file:
# file.writelines(["%s\n" % item for item in feature])
As reported in this publication, GRNs specific for MEP (S1->S2) and LMPP(S1->S3) were built based on pre-selected 31 features. In the following, cells from above branches will be individually tested for gene correlation.
feature = ['Bptf', 'Cbfa2t3', 'Erg', 'Ets1', 'Ets2', 'Etv6', 'Fli1', 'Gata1', 'Gata2', 'Gata3', 'Gfi1b', 'Hhex',
'Hoxa5', 'Hoxa9', 'Hoxb4', 'Ikzf1', 'Ldb1', 'Lmo2', 'Lyl1', 'Meis1', 'Mitf', 'Myb', 'Nfe2', 'Nkx2-3',
'Notch1', 'Pbx1', 'Prdm16', 'Runx1', 'Smarcc1', 'Tal1', 'Tcf7']
In order to evaluate the performance of different GRN methods, reference linkage was download from string PPI filtered by high confidence score > 0.7 and PPI enrichment p-value < 1.0e-16. There are 420 edges left among the 117 features. 119 isolated ndoes without connecting to any others were removed for GRN construction.
import networkx as nx
#leaf markers
Ref_edges = pd.read_csv('/home/mwu/Downloads/string_interactions_2.tsv',
sep='\t').iloc[:,:2]
#
Ref_edges = pd.read_csv('/home/mwu/Desktop/refNetwork.csv',
sep=',').iloc[:,:2]
Ref_edges.columns = ['source', 'target']
Ref_edges['weight'] = 1
G = nx.from_pandas_edgelist(Ref_edges)
feature = list(set(feature) & set(G.nodes))
len(feature)
# Set window size to compute moving window synchrony.
r_window_size = 50
subExpr = stream_gne.ExpMatrix.loc[feature, cell]
df = subExpr.T
genes = ['Mki67', 'Top2a']
# Interpolate missing data.
df_interpolated = df.interpolate()
# Compute rolling window synchrony
rolling_r = df_interpolated[genes[0]].rolling(window=r_window_size, center=True).corr(df_interpolated[genes[1]])
f,ax=plt.subplots(2,1,figsize=(14,6),sharex=True)
df[genes].rolling(window=r_window_size,center=True).mean().plot(ax=ax[0])
ax[0].set(xlabel='Frame',ylabel='Expression level')
rolling_r.plot(ax=ax[1])
ax[1].set(xlabel='Frame',ylabel='Pearson r')
plt.suptitle("pairwise gene and rolling window correlation")
As reported in GRN benchmark paper, although some tools were designed for estimating gene regulatory relationship from transcriptomic data, only three methods (GENIE3, PIDC and GRNBOOST2) were considered as competitive. Besides, two additional methods (window_rolling and phase synchrony) are also provided in the pyscnet.
stream_gne = gs.get_synchrony(stream_gne.deepcopy, method='window_rolling',
feature=feature, cell=cell)
stream_gne = gs.get_synchrony(stream_gne.deepcopy, method='phase_synchrony',
feature=feature, cell=cell)
#GRNs for MEP branch
stream_gne = gdocker.rundocker(stream_gne.deepcopy, method='GENIE3', directed=False,
feature=feature, cell=cell)
stream_gne = gdocker.rundocker(stream_gne.deepcopy, method='PIDC', directed=False,
feature=feature, cell=cell)
stream_gne = gdocker.rundocker(stream_gne.deepcopy, method='CORR', directed=False,
feature=feature, cell=cell)
stream_gne = gdocker.rundocker(stream_gne.deepcopy, method='GRNBOOST2', directed=False,
feature=feature, cell=cell)
To quality the confidence of GRNs obtained from differen methods, top 420 edges were selected from each methods. ROC and PR curves were ultilized for methods evaluation.
Ref_edges = pd.read_csv('/home/mwu/Downloads/string_interactions_2.tsv',
sep='\t').iloc[:,:2]
Ref_edges.columns = ['source', 'target']
Ref_edges['weight'] = 1
Ref_edges.shape
#remove self-loop edges
index_list=list()
for i in range(Ref_edges.shape[0]):
if Ref_edges.iloc[i]['source'] == Ref_edges.iloc[i]['target']:
index_list.append(i)
Ref_edges=Ref_edges.drop(index_list)
#As edges are undirected, the following is to create Full reference links (117*116/2) and then merge with the string links
from itertools import product, permutations, combinations, combinations_with_replacement
Full_Ref = pd.DataFrame(combinations(set(feature), 2), columns=['source', 'target'])
Full_Ref['weight'] = 0
Ref_edges['name'] = Ref_edges['source'] + '_' + Ref_edges['target']
Full_Ref['name'] = Full_Ref['source']+ '_' + Full_Ref['target']
i = 0
for name in Ref_edges.name:
tmp = list(name.split('_'))
if name in list(Full_Ref['name']):
Full_Ref.loc[Full_Ref.name == name, 'weight'] = 1
elif tmp[1] + '_' + tmp[0] in list(Full_Ref['name']):
Full_Ref.loc[Full_Ref.name == tmp[1] + '_' + tmp[0], 'weight'] = 1
else:
i = i +1
Full_Ref[Full_Ref.weight == 1].shape
As shown below, phase synchrony method outperforms others in terms of top 420 edges.
import alg_evaluation
dict_of_algs = dict()
link_keys = list(filter(lambda x:'_links' in x, stream_gne.NetAttrs.keys()))
for link in link_keys:
method = link.split('_links')[0]
top = stream_gne.NetAttrs[link].sort_values('weight', ascending=False).head(420)
dict_of_algs[method] = alg_evaluation.evaluation(top, copy.deepcopy(Full_Ref))
fig = plt.figure(figsize=(20, 8))
grid = plt.GridSpec(1, 2, wspace=0.2, hspace=0.2)
ax1 = fig.add_subplot(grid[0, 0])
ax2 = fig.add_subplot(grid[0, 1])
filename='test'
alg_evaluation.build_curves(ax1, dict_of_algs, 'ROC', filename, linewidth=3)
alg_evaluation.build_curves(ax2, dict_of_algs, 'PCR', filename, linewidth=3)
fig.savefig('result/test_consensus_confidence.pdf')
NetEnrich module integretes graph traversal techniques to expolre gene regulatory network. It includes Breadth-first search, Depth-first search and Supervised random walk. In this way, gene trigger paths indicating genes with hidden associations can be predicted.
#build graph
from pyscnet.NetEnrich import graph_toolkit as gt
stream_gne = gt.buildnet(stream_gne, key_links='phase_synchrony_links', top=420)
#calculate node centrality
stream_gne = gt.get_centrality(stream_gne)
#detect gene community
stream_gne = gt.detect_community(stream_gne)
We take node Mki67 as a starting point and perform a 5 steps random walk guided by node centrality (degree). It generates a random path (Mki67-Rrm1-Rrm2-Mcm5-Mcm7-Pa2g4). Superisely, three co-regulation edges (Mki67-Rrm2, Mki67-Mcm5, Mki67-Mcm7) missed in prediction were detected via random walk.
# Breadth-first search to explore gene network
# bfs_path = list(gt.graph_traveral(stream_gne.NetAttrs['graph'], start='Mki67', threshold=1, method='dfs'))
# bfs_path
# #Supervised random walk
random_path = gt.random_walk(stream_gne, start = 'Mki67', supervisedby='degree', steps=5)
random_path
import graph_tool.all as gta
import matplotlib.pyplot as plt
from pyscnet.Plotting import _nx2gt as nx2gt
#select top 420 edges
tmp = stream_gne.NetAttrs['phase_synchrony_links'].sort_values('weight', ascending=False).head(420)
net1 = Ref_edges.loc[(Ref_edges.target=='Mki67') | (Ref_edges.source == 'Mki67')]
net2 = tmp.loc[(tmp.source=='Mki67') | (tmp.target=='Mki67')]
net3 = tmp.loc[(tmp.source.isin(['Mki67'])) | (tmp.target.isin(['Mki67', 'Rrm1', 'Rrm2', 'Mcm5', 'Mcm7', 'Pa2g4']))]
plt.switch_backend("cairo")
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
G1 = nx2gt.nx2gt(nx.from_pandas_edgelist(net1, edge_attr='weight'))
pos = gta.fruchterman_reingold_layout(G1)
edge_color = G1.new_edge_property("string")
labels = dict(zip(list(range(G1.num_vertices())), list(G1.vertex_properties['id'])))
gta.graph_draw(G1, vertex_text=G1.vertex_properties['id'], vertex_text_position=-2,
vertex_size=0.8, vertex_font_size=0.2, mplfig=ax[0])
ax[0].set_title('Edges connected to Mki67 in StringPPI')
G1 = nx2gt.nx2gt(nx.from_pandas_edgelist(net2, edge_attr='weight'))
pos = gta.fruchterman_reingold_layout(G1)
edge_color = G1.new_edge_property("string")
labels = dict(zip(list(range(G1.num_vertices())), list(G1.vertex_properties['id'])))
gta.graph_draw(G1, vertex_text=G1.vertex_properties['id'], vertex_text_position=-2,
vertex_size=0.8, vertex_font_size=0.2, mplfig=ax[1])
ax[1].set_title('Edges connected to Mki67 in prediction')
fig, ax = plt.subplots(1, figsize=(20, 8))
G1 = nx2gt.nx2gt(nx.from_pandas_edgelist(net3, edge_attr='weight'))
pos = gta.fruchterman_reingold_layout(G1)
edge_color = G1.new_edge_property("string")
labels = dict(zip(list(range(G1.num_vertices())), list(G1.vertex_properties['id'])))
highlight_path=random_path
for e in G1.edges():
if (highlight_path is not None) and (labels[list(e)[0]] in highlight_path) and (
labels[list(e)[1]] in highlight_path) and abs(highlight_path.index(labels[list(e)[0]]) - highlight_path.index(labels[list(e)[1]])) == 1:
edge_color[e] = 'darkorange'
else:
edge_color[e] = 'lightgrey'
pos = gta.arf_layout(G1)
gta.graph_draw(G1, vertex_text=G1.vertex_properties['id'], vertex_text_position=-2,
edge_color=edge_color, vertex_size=1.2, vertex_font_size=0.3, mplfig=ax)
ax.set_title('5 steps random walk starting from Mki67')
Plotting module principlely provides a set of functions for visualization.
from pyscnet.Plotting import net_plot as npl
plt.switch_backend("cairo")
fig, ax = plt.subplots(1, 2, figsize=(40, 18))
# npl.net_matrix_plot(stream_gne, output_size=(500, 500), vertex_text_position=-2, vertex_font_size=5)
npl.net_hierarchy_plot(stream_gne, vertex_text_position=-2, vertex_size=0.2, vertex_font_size=0.05, mplfig=ax[0])
npl.net_matrix_plot(stream_gne, vertex_text_position=-2, vertex_size=0.8, vertex_font_size=0.2, mplfig=ax[1])
from pyscnet.Plotting import __geneDance_plot as gp
cg = gp.geneHeatmap(stream_gne, gene=['Mki67', 'Top2a', 'Ccnb2'], figsize=[15,1.5],
cell_clusterid=('S2', 'S1'), select_by='branch_id_alias', yticklabels=False,
order_by='S1_pseudotime', col_cluster=False, cmap='RdBu_r')
cg.ax_row_dendrogram.set_visible(False)
# #gene correlation heatmap
gp.geneCorrelation(stream_gne, gene=feature,
cell_clusterid=('S2', 'S1'), select_by='branch_id_alias',
order_by='S1_pseudotime', xticklabels=False, yticklabels=False,
save_as='result/gene_correlation.pdf', figsize=[5, 5])
#save stream_gne as pickle
# stream_gne.save_as('data/pyscnet_stream.pk')
stream_gne = gnetdata.load_Gnetdata_object('data/pyscnet_stream.pk')
import openpyxl
tmp = adata.uns['leaf_markers'][('S1', 'S2')].loc[feature]
tmp.to_excel('result/selected_features.xlsx')